author: Sara Altman id: tidymodel-prediction-workflows-inside-snowflake-with-orbital summary: Tidymodel Prediction Workflows Inside Snowflake with Orbital categories: Data-Science environments: web status: Published feedback link: https://github.com/Snowflake-Labs/sfguides/issues tags: Data Science, R, Posit Workbench, Native Applications
Duration: 1
This Quickstart will demonstrate how to run model predictions inside Snowflake using the orbital R package and the Snowflake Native Posit Workbench App.
With the orbital package, you can:
We’ll use loan data from LendingClub to build an example model in this Quickstart.
accountadmin role.accountadmin role or role
that allows you to:
You can follow along with this Quickstart guide, or look at the materials provided in the accompanying repository: https://github.com/posit-dev/snowflake-posit-quickstart-orbital.
Duration: 5
Before we begin there are a few components we need to prepare. We need to:
For this analysis, we’ll use loan data from LendingClub. We’ve made the data available in an S3 bucket.
The data contains information for about 2.3 million loans from Southern US states.
In Snowsight, click Create >
SQL Worksheet. Copy this
SQL code and paste it into the worksheet. Run the code to create the
database and table you’ll need for this Quickstart.
aside negative
You may need to change the role granted usage from
SYSADMINto your desired role.
After running the code, you should be able to see the Lending Club data in Snowsight.
Navigate to Data > Databases and select
the database where you added the data (e.g., LENDING_CLUB).
Expand the database, schema, and tables until you see the
LOAN_DATA table.
We can now start the modeling process with the data using the Posit Workbench Native App. The Posit Workbench Native App allows users to develop in their preferred IDE in Posit Workbench with their Snowflake data, all while adhering to Snowflake’s security and governance protocols.
Once Posit Workbench is installed, click on the app under
Installed Apps. If you do not see the Posit Workbench app
listed, ask your Snowflake account administrator for access to the
app.
After clicking on the app, you will see a page with configuration
instructions and a blue Launch app button.
Click on Launch app to launch the app. You may be
prompted to first login to Snowflake using your regular credentials or
authentication method.
Posit Workbench provides several IDEs, such as RStudio Pro, JupyterLab, and VS Code. For this Quickstart, we will use RStudio Pro.
Within Posit Workbench, click New Session to launch a
new session.
When prompted, select RStudio Pro.
Next, connect to your Snowflake account from within Posit Workbench.
Under Session Credentials, click the button with the
Snowflake icon to sign in to Snowflake. Follow the sign in prompts.
When you’re successfully signed in to Snowflake, the Snowflake button will turn blue and there will be a check mark in the upper-left corner.
Click Start Session to launch the RStudio Pro IDE.
Once everything is ready, you will be able to work with your Snowflake data in the familiar RStudio Pro IDE. Since the IDE is provided by the Posit Workbench Native App, your entire analysis will run securely within Snowflake.
This Quickstart will walk you through the code contained in https://github.com/posit-dev/snowflake-posit-quickstart-orbital/blob/main/fit_and_deploy.R. To follow along, open the file in your RStudio Pro IDE. There are two ways to do this:
Simple copy-and-paste Go to File > New File > R Script and then copy the contents of fit_and_deploy.R into your new file.
Starting a new project linked to the GitHub repo. To do this:
File > New Project in the RStudio
IDE menu bar.RStudio will clone a local copy of the materials on GitHub. You can
use the Files pane in the bottom right-hand corner of the IDE to
navigate to quarto.qmd. Click on the file to open
it.
aside negative
SSH authentication is not available in Snowpark Container Services, so when creating projects from Git, you may need to authenticate Git operations over HTTPS, using a username and password or a personal access token.
Now that we’re in a familiar R environment, we need to install the necessary packages. We’ll use the tidymodels ecosystem of packages, as well as a few others.
install.packages(
c(
"odbc",
"DBI",
"dbplyr",
"dplyr",
"glue",
"arrow",
"stringr",
"tidymodels",
"vetiver",
"rsconnect",
"pins",
"orbital",
"tidypredict",
"ggplot2"
)
)
After we install the packages, we load them.
library(odbc)
library(DBI)
library(dbplyr)
library(dplyr)
library(glue)
library(arrow)
library(stringr)
library(tidymodels)
library(vetiver)
library(rsconnect)
library(pins)
library(orbital)
library(tidypredict)
library(ggplot2)
Duration: 5
Before starting the modeling process, we need to connect to our database and load the loan data.
We’ll use the DBI and odbc R packages to connect to the database. We’ll then use dplyr and dbplyr to query the data with R without having to write raw SQL. To learn more, see Analyze Data with R using Posit Workbench and Snowflake.
First, we use DBI::dbConnect() to connect to Snowflake.
We’ll also need a driver provided by the odbc package. The
warehouse, database, and schema
arguments specify our desired Snowflake warehouse, database, and
schema.
con <- dbConnect(
odbc::snowflake(),
warehouse = "DEFAULT_WH",
database = "LENDING_CLUB",
schema = "ML"
)
aside negative
You may need to change
warehouse,database, andschemato match your environment.
con now stores our connection.
Once connected, we can view available databases, schemas, and tables in the RStudio IDE Connections pane. Click on the database icon to the right of a database to see its schemas. Click on the schema icon to the right of a schema to see its tables. Click the table icon to preview the table.
tblOnce we build a connection, we can use dplyr::tbl() to
create tbls. A tbl is an R object that represents a table
or view accessed through a connection.
con |> tbl("LOAN_DATA")
We can also use dbplyr to translate typical dplyr verbs into SQL. To
prepare our tbl for modeling, we’ll extract the year and
month information from the ISSUE_D column.
lendingclub_dat <-
con |>
tbl("LOAN_DATA") |>
mutate(
ISSUE_YEAR = as.integer(str_sub(ISSUE_D, start = 5)),
ISSUE_MONTH = as.integer(
case_match(
str_sub(ISSUE_D, end = 3),
"Jan" ~ 1,
"Feb" ~ 2,
"Mar" ~ 3,
"Apr" ~ 4,
"May" ~ 5,
"Jun" ~ 6,
"Jul" ~ 7,
"Aug" ~ 8,
"Sep" ~ 9,
"Oct" ~ 10,
"Nov" ~ 11,
"Dec" ~ 12
)
)
)
We don’t want to fit our model on all 2.3 million rows, so we’ll filter to a single year and then sample 5,000 rows.
lendingclub_sample <-
lendingclub_dat |>
filter(ISSUE_YEAR == 2016) |>
slice_sample(n = 5000)
Our ultimate goal is to create a model that predicts loan interest
rates (the INT_RATE column). To prepare our data for
modeling, we’ll first select a few columns of interest (loan term,
credit utilization, credit open-to-buy, and all utilization), convert
INT_RATE to a numeric variable, and remove missing
values.
aside positive Typically, we’d find our columns of interest with exploratory data analysis techniques, but for the purposes of this Quickstart, we’ve skipped this step.
lendingclub_prep <-
lendingclub_sample |>
select(INT_RATE, TERM, BC_UTIL, BC_OPEN_TO_BUY, ALL_UTIL) |>
mutate(INT_RATE = as.numeric(str_remove(INT_RATE, "%"))) |>
filter(!if_any(everything(), is.na)) |>
collect()
collect() executes a query and returns the
entire result as a tibble, so lendingclub_prep now
contains our entire filtered sample.
Duration: 10
Next, we’ll create a tidymodels workflow. Workflows bundle pre-processing, modeling, and post-processing steps. Learn more about workflows here.
The first step is to specify our model formula and pre-processing steps using the recipes package.
# Pre-processing
lendingclub_rec <-
recipe(INT_RATE ~ ., data = lendingclub_prep) |>
step_mutate(TERM = (TERM == "60 months")) |>
step_mutate(across(!TERM, as.numeric)) |>
step_normalize(all_numeric_predictors()) |>
step_impute_mean(all_of(c("BC_OPEN_TO_BUY", "BC_UTIL"))) |>
step_filter(!if_any(everything(), is.na))
Our recipe lendingclub_rec defines our model formula:
we’ll use all the other variables in the dataset to predict the interest
rate. We also perform a few pre-processing steps, including turning
TERM into a logical, normalizing all numeric predictors,
and removing missing values.
Next, we define the type of model we want to fit: a linear model.
linear_reg() is from the parsnip package.
lendingclub_lr <- linear_reg()
Now, we can create our workflow, adding our linear model and pre-processing recipe.
lendingclub_wf <-
workflow() |>
add_model(lendingclub_lr) |>
add_recipe(lendingclub_rec)
Our workflow object now looks like this:
lendingclub_wf
══ Workflow ═══════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_mutate()
• step_mutate()
• step_normalize()
• step_impute_mean()
• step_filter()
── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
Notice that it includes our pre-processing steps and model specification.
Now that we have our workflow object lendingclub_wf, we
can use it to fit our model.
lendingclub_fit <-
lendingclub_wf |>
fit(data = lendingclub_prep)
To understand how well our model performs, we define a metric set
using yardstick::metric_set() and then compute those
metrics for lendingclub_fit.
lendingclub_metric_set <- metric_set(rmse, mae, rsq)
lendingclub_metrics <-
lendingclub_fit |>
augment(lendingclub_prep) |>
lendingclub_metric_set(truth = INT_RATE, estimate = .pred)
lendingclub_metrics looks like this:
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 4.38
2 mae standard 3.43
3 rsq standard 0.233
The vetiver package provides tools to version, share, deploy, and monitor models.
We’ll use vetiver to version and write our fitted model to a pin on Posit Connect. This will allow us to identify which version of our model is active and track performance against other models over time.
First, we need to connect to a Posit Connect board.
board <- board_connect()
aside negative
To run
rsconnect::board_connect(), you’ll first need to authenticate. To authenticate, navigate toTools>Global Options>Publishing>Connectand follow the instructions.
Then, we create a vetiver model with vetiver_model(),
supplying the function with our fitted model, model name, and metadata
containing our metrics.
model_name <- "interest_rate_prediction"
v <-
vetiver_model(
lendingclub_fit,
model_name,
metadata = list(metrics = lendingclub_metrics)
)
vetiver_pin_write() writes the vetiver model to Posit
Connect as a pin.
board |> vetiver_pin_write(v)
aside negative
If you run into a namespacing error in Connect, or a permissioning error in Snowflake, that may mean someone has already run this code with the same
model_name. You’ll need to pick a different value.
We can use pin_versions() to return all the different
versions of this model.
model_versions <-
board |>
pin_versions(glue("{board$account}/{model_name}"))
aside positive
Unless you’ve done this Quickstart multiple times, you’ll probably only have one model version.
Let’s grab the active version of the model. We’ll use this later when interacting with Snowflake.
model_version <-
model_versions |>
filter(active) |>
pull(version)
At this point, we’ve:
Now, we’re ready to deploy the model with orbital and Snowflake.
Duration: 10
The orbital package allows you to run tidymodels workflow predictions inside databases, including Snowflake, substantially speeding up the prediction process.
To do so, orbital converts tidymodels workflows into SQL that can run on Snowflake. You can then either use that SQL to run the predictions of that model or deploy the model directly to Snowflake.
Let’s convert our tidymodels object into an orbital object with
orbital().
orbital_obj <- orbital(lendingclub_fit)
orbital_obj
── orbital Object ─────────────────────────────────────────────────────────────────────────────────────────────────
• TERM = (TERM == "60 months")
• BC_UTIL = (BC_UTIL - 58.37713) / 27.86179
• BC_OPEN_TO_BUY = (BC_OPEN_TO_BUY - 10906.02) / 16427.05
• ALL_UTIL = (ALL_UTIL - 60.11253) / 20.41787
• BC_OPEN_TO_BUY = dplyr::if_else(is.na(BC_OPEN_TO_BUY), 2.678412e-17, BC_OPEN_TO_BUY)
• BC_UTIL = dplyr::if_else(is.na(BC_UTIL), 2.691542e-17, BC_UTIL)
• .pred = 11.96049 + (ifelse(TERM == "TRUE", 1, 0) * 4.308234) + (BC_UTIL * 0.1679237) + (BC_OPEN_TO_BUY * - ...
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
7 equations in total.
If you want to see the SQL statement generated by the model, run
orbital_sql().
sql_predictor <- orbital_sql(orbital_obj, con)
sql_predictor
<SQL> ("TERM" = '60 months') AS TERM
<SQL> ("BC_UTIL" - 58.3771301356001) / 27.8617927834219 AS BC_UTIL
<SQL> ("BC_OPEN_TO_BUY" - 10906.0174053835) / 16427.0528247826 AS BC_OPEN_TO_BUY
<SQL> ("ALL_UTIL" - 60.1125278283748) / 20.4178686256935 AS ALL_UTIL
<SQL> CASE WHEN (("BC_OPEN_TO_BUY" IS NULL)) THEN 2.67841164255808e-17 WHEN NOT (("BC_OPEN_TO_BUY" IS NULL)) THEN "BC_OPEN_TO_BUY" END AS BC_OPEN_TO_BUY
<SQL> CASE WHEN (("BC_UTIL" IS NULL)) THEN 2.69154231610625e-17 WHEN NOT (("BC_UTIL" IS NULL)) THEN "BC_UTIL" END AS BC_UTIL
<SQL> (((11.9604886671368 + (CASE WHEN ("TERM" = 'TRUE') THEN 1.0 WHEN NOT ("TERM" = 'TRUE') THEN 0.0 END * 4.30823403045859)) + ("BC_UTIL" * 0.167923700948935)) + ("BC_OPEN_TO_BUY" * -0.892522593408621)) + ("ALL_UTIL" * 0.684415441909473) AS .pred
Notice that you can see all the pre-processing steps and model prediction steps.
Now that we’ve converted our tidymodels workflow object to SQL with orbital, we can run model predictions inside Snowflake.
Calling predict() with our orbital object will run the
SQL code we saw above directly in Snowflake.
start_time <- Sys.time()
preds <-
predict(orbital_obj, lendingclub_dat) |>
compute(name = "LENDING_CLUB_PREDICTIONS_TEMP")
end_time <- Sys.time()
preds
# Source: SQL [?? x 1]
# Database: Snowflake 8.44.2[@Snowflake/LENDING_CLUB]
.pred
<dbl>
1 13.6
2 17.0
3 12.5
4 13.6
5 11.9
6 16.8
7 12.3
8 11.3
9 14.0
10 13.5
# ℹ more rows
# ℹ Use `print(n = ...)` to see more rows
aside positive
We’ve also used
dplyr::compute()to force the query to compute—withoutcompute(),predict()would only be evaluated lazily.compute()saves the results to a temporary table. TheSys.time()calls will help us determine how fast orbital computed our predictions.
To figure out how much orbital sped up our process, let’s see how many predictions we computed:
preds |> count()
# Source: SQL [1 x 1]
# Database: Snowflake 8.44.2[@Snowflake/LENDING_CLUB]
n
<dbl>
1 2260702
as well as how much time elapsed:
end_time - start_time
Time difference of 3.027164 secs
2,260,702 predictions in just 3.02 seconds—thanks to Snowflake and orbital!
Duration: 5
Next, we’ll deploy our model so others can use it. We have a couple of options.
One option is to write the predictions back to Snowflake as a
permanent table by setting temporary = FALSE in
compute():
preds <-
predict(orbital_obj, lendingclub_dat) |>
compute(name = "LENDING_CLUB_PREDICTIONS")
Another is to write our model prediction function as a view.
aside positive
A view allows query results to be accessed like a table. The query will execute anytime the view is called.
To create this view, we first need to construct the SQL query that we want to store in the view.
view_sql <-
lendingclub_dat |>
mutate(!!!orbital_inline(orbital_obj)) |>
select(any_of(c("ID", ".pred"))) |>
remote_query()
orbital_inline() allows us to use our orbital object in
a dplyr pipeline. We can then use select() to select just
the columns we need for our predictions view: ID (to
identify the rows) and .pred (which contains the
predictions).
dbplyr::remote_query() converts our dplyr pipeline back
into SQL. Let’s take a look at the result.
view_sql
<SQL> SELECT
"ID",
(((11.8433051541854 + (CASE WHEN ("TERM" = 'TRUE') THEN 1.0 WHEN NOT ("TERM" = 'TRUE') THEN 0.0 END * 4.38875098270329)) + ("BC_UTIL" * -0.083097744438186)) + ("BC_OPEN_TO_BUY" * -0.894697830285346)) + ("ALL_UTIL" * 0.743982590927869) AS ".pred"
FROM (
SELECT
"ID",
"MEMBER_ID",
"LOAN_AMNT",
"FUNDED_AMNT",
"FUNDED_AMNT_INV",
"TERM",
"INT_RATE",
"INSTALLMENT",
"GRADE",
"SUB_GRADE",
"EMP_TITLE",
"EMP_LENGTH",
"HOME_OWNERSHIP",
"ANNUAL_INC",
"VERIFICATION_STATUS",
"ISSUE_D",
"LOAN_STATUS",
"PYMNT_PLAN",
"URL",
"DESC",
"PURPOSE",
"TITLE",
"ZIP_CODE",
"ADDR_STATE",
"DTI",
"DELINQ_2YRS",
"EARLIEST_CR_LINE",
"FICO_RANGE_LOW",
"FICO_RANGE_HIGH",
"INQ_LAST_6MTHS",
"MTHS_SINCE_LAST_DELINQ",
"MTHS_SINCE_LAST_RECORD",
"OPEN_ACC",
"PUB_REC",
"REVOL_BAL",
"REVOL_UTIL",
"TOTAL_ACC",
"INITIAL_LIST_STATUS",
"OUT_PRNCP",
"OUT_PRNCP_INV",
"TOTAL_PYMNT",
"TOTAL_PYMNT_INV",
"TOTAL_REC_PRNCP",
"TOTAL_REC_INT",
"TOTAL_REC_LATE_FEE",
"RECOVERIES",
"COLLECTION_RECOVERY_FEE",
"LAST_PYMNT_D",
"LAST_PYMNT_AMNT",
"NEXT_PYMNT_D",
"LAST_CREDIT_PULL_D",
"LAST_FICO_RANGE_HIGH",
"LAST_FICO_RANGE_LOW",
"COLLECTIONS_12_MTHS_EX_MED",
"MTHS_SINCE_LAST_MAJOR_DEROG",
"POLICY_CODE",
"APPLICATION_TYPE",
"ANNUAL_INC_JOINT",
"DTI_JOINT",
"VERIFICATION_STATUS_JOINT",
"ACC_NOW_DELINQ",
"TOT_COLL_AMT",
"TOT_CUR_BAL",
"OPEN_ACC_6M",
"OPEN_ACT_IL",
"OPEN_IL_12M",
"OPEN_IL_24M",
"MTHS_SINCE_RCNT_IL",
"TOTAL_BAL_IL",
"IL_UTIL",
"OPEN_RV_12M",
"OPEN_RV_24M",
"MAX_BAL_BC",
"ALL_UTIL",
"TOTAL_REV_HI_LIM",
"INQ_FI",
"TOTAL_CU_TL",
"INQ_LAST_12M",
"ACC_OPEN_PAST_24MTHS",
"AVG_CUR_BAL",
CASE WHEN (("BC_OPEN_TO_BUY" IS NULL)) THEN 2.4581101886311e-17 WHEN NOT (("BC_OPEN_TO_BUY" IS NULL)) THEN "BC_OPEN_TO_BUY" END AS "BC_OPEN_TO_BUY",
CASE WHEN (("BC_UTIL" IS NULL)) THEN 9.95071097045741e-17 WHEN NOT (("BC_UTIL" IS NULL)) THEN "BC_UTIL" END AS "BC_UTIL",
"CHARGEOFF_WITHIN_12_MTHS",
"DELINQ_AMNT",
"MO_SIN_OLD_IL_ACCT",
"MO_SIN_OLD_REV_TL_OP",
"MO_SIN_RCNT_REV_TL_OP",
"MO_SIN_RCNT_TL",
"MORT_ACC",
"MTHS_SINCE_RECENT_BC",
"MTHS_SINCE_RECENT_BC_DLQ",
"MTHS_SINCE_RECENT_INQ",
"MTHS_SINCE_RECENT_REVOL_DELINQ",
"NUM_ACCTS_EVER_120_PD",
"NUM_ACTV_BC_TL",
"NUM_ACTV_REV_TL",
"NUM_BC_SATS",
"NUM_BC_TL",
"NUM_IL_TL",
"NUM_OP_REV_TL",
"NUM_REV_ACCTS",
"NUM_REV_TL_BAL_GT_0",
"NUM_SATS",
"NUM_TL_120DPD_2M",
"NUM_TL_30DPD",
"NUM_TL_90G_DPD_24M",
"NUM_TL_OP_PAST_12M",
"PCT_TL_NVR_DLQ",
"PERCENT_BC_GT_75",
"PUB_REC_BANKRUPTCIES",
"TAX_LIENS",
"TOT_HI_CRED_LIM",
"TOTAL_BAL_EX_MORT",
"TOTAL_BC_LIMIT",
"TOTAL_IL_HIGH_CREDIT_LIMIT",
"REVOL_BAL_JOINT",
"SEC_APP_FICO_RANGE_LOW",
"SEC_APP_FICO_RANGE_HIGH",
"SEC_APP_EARLIEST_CR_LINE",
"SEC_APP_INQ_LAST_6MTHS",
"SEC_APP_MORT_ACC",
"SEC_APP_OPEN_ACC",
"SEC_APP_REVOL_UTIL",
"SEC_APP_OPEN_ACT_IL",
"SEC_APP_NUM_REV_ACCTS",
"SEC_APP_CHARGEOFF_WITHIN_12_MTHS",
"SEC_APP_COLLECTIONS_12_MTHS_EX_MED",
"SEC_APP_MTHS_SINCE_LAST_MAJOR_DEROG",
"HARDSHIP_FLAG",
"HARDSHIP_TYPE",
"HARDSHIP_REASON",
"HARDSHIP_STATUS",
"DEFERRAL_TERM",
"HARDSHIP_AMOUNT",
"HARDSHIP_START_DATE",
"HARDSHIP_END_DATE",
"PAYMENT_PLAN_START_DATE",
"HARDSHIP_LENGTH",
"HARDSHIP_DPD",
"HARDSHIP_LOAN_STATUS",
"ORIG_PROJECTED_ADDITIONAL_ACCRUED_INTEREST",
"HARDSHIP_PAYOFF_BALANCE_AMOUNT",
"HARDSHIP_LAST_PAYMENT_AMOUNT",
"DISBURSEMENT_METHOD",
"DEBT_SETTLEMENT_FLAG",
"DEBT_SETTLEMENT_FLAG_DATE",
"SETTLEMENT_STATUS",
"SETTLEMENT_DATE",
"SETTLEMENT_AMOUNT",
"SETTLEMENT_PERCENTAGE",
"SETTLEMENT_TERM",
"ISSUE_YEAR",
"ISSUE_MONTH"
FROM (
SELECT
"ID",
"MEMBER_ID",
"LOAN_AMNT",
"FUNDED_AMNT",
"FUNDED_AMNT_INV",
("TERM" = '60 months') AS "TERM",
"INT_RATE",
"INSTALLMENT",
"GRADE",
"SUB_GRADE",
"EMP_TITLE",
"EMP_LENGTH",
"HOME_OWNERSHIP",
"ANNUAL_INC",
"VERIFICATION_STATUS",
"ISSUE_D",
"LOAN_STATUS",
"PYMNT_PLAN",
"URL",
"DESC",
"PURPOSE",
"TITLE",
"ZIP_CODE",
"ADDR_STATE",
"DTI",
"DELINQ_2YRS",
"EARLIEST_CR_LINE",
"FICO_RANGE_LOW",
"FICO_RANGE_HIGH",
"INQ_LAST_6MTHS",
"MTHS_SINCE_LAST_DELINQ",
"MTHS_SINCE_LAST_RECORD",
"OPEN_ACC",
"PUB_REC",
"REVOL_BAL",
"REVOL_UTIL",
"TOTAL_ACC",
"INITIAL_LIST_STATUS",
"OUT_PRNCP",
"OUT_PRNCP_INV",
"TOTAL_PYMNT",
"TOTAL_PYMNT_INV",
"TOTAL_REC_PRNCP",
"TOTAL_REC_INT",
"TOTAL_REC_LATE_FEE",
"RECOVERIES",
"COLLECTION_RECOVERY_FEE",
"LAST_PYMNT_D",
"LAST_PYMNT_AMNT",
"NEXT_PYMNT_D",
"LAST_CREDIT_PULL_D",
"LAST_FICO_RANGE_HIGH",
"LAST_FICO_RANGE_LOW",
"COLLECTIONS_12_MTHS_EX_MED",
"MTHS_SINCE_LAST_MAJOR_DEROG",
"POLICY_CODE",
"APPLICATION_TYPE",
"ANNUAL_INC_JOINT",
"DTI_JOINT",
"VERIFICATION_STATUS_JOINT",
"ACC_NOW_DELINQ",
"TOT_COLL_AMT",
"TOT_CUR_BAL",
"OPEN_ACC_6M",
"OPEN_ACT_IL",
"OPEN_IL_12M",
"OPEN_IL_24M",
"MTHS_SINCE_RCNT_IL",
"TOTAL_BAL_IL",
"IL_UTIL",
"OPEN_RV_12M",
"OPEN_RV_24M",
"MAX_BAL_BC",
("ALL_UTIL" - 59.8336032388664) / 20.4722401880186 AS "ALL_UTIL",
"TOTAL_REV_HI_LIM",
"INQ_FI",
"TOTAL_CU_TL",
"INQ_LAST_12M",
"ACC_OPEN_PAST_24MTHS",
"AVG_CUR_BAL",
("BC_OPEN_TO_BUY" - 10724.3929149798) / 15681.5732122865 AS "BC_OPEN_TO_BUY",
("BC_UTIL" - 58.3194129554656) / 28.1641014108481 AS "BC_UTIL",
"CHARGEOFF_WITHIN_12_MTHS",
"DELINQ_AMNT",
"MO_SIN_OLD_IL_ACCT",
"MO_SIN_OLD_REV_TL_OP",
"MO_SIN_RCNT_REV_TL_OP",
"MO_SIN_RCNT_TL",
"MORT_ACC",
"MTHS_SINCE_RECENT_BC",
"MTHS_SINCE_RECENT_BC_DLQ",
"MTHS_SINCE_RECENT_INQ",
"MTHS_SINCE_RECENT_REVOL_DELINQ",
"NUM_ACCTS_EVER_120_PD",
"NUM_ACTV_BC_TL",
"NUM_ACTV_REV_TL",
"NUM_BC_SATS",
"NUM_BC_TL",
"NUM_IL_TL",
"NUM_OP_REV_TL",
"NUM_REV_ACCTS",
"NUM_REV_TL_BAL_GT_0",
"NUM_SATS",
"NUM_TL_120DPD_2M",
"NUM_TL_30DPD",
"NUM_TL_90G_DPD_24M",
"NUM_TL_OP_PAST_12M",
"PCT_TL_NVR_DLQ",
"PERCENT_BC_GT_75",
"PUB_REC_BANKRUPTCIES",
"TAX_LIENS",
"TOT_HI_CRED_LIM",
"TOTAL_BAL_EX_MORT",
"TOTAL_BC_LIMIT",
"TOTAL_IL_HIGH_CREDIT_LIMIT",
"REVOL_BAL_JOINT",
"SEC_APP_FICO_RANGE_LOW",
"SEC_APP_FICO_RANGE_HIGH",
"SEC_APP_EARLIEST_CR_LINE",
"SEC_APP_INQ_LAST_6MTHS",
"SEC_APP_MORT_ACC",
"SEC_APP_OPEN_ACC",
"SEC_APP_REVOL_UTIL",
"SEC_APP_OPEN_ACT_IL",
"SEC_APP_NUM_REV_ACCTS",
"SEC_APP_CHARGEOFF_WITHIN_12_MTHS",
"SEC_APP_COLLECTIONS_12_MTHS_EX_MED",
"SEC_APP_MTHS_SINCE_LAST_MAJOR_DEROG",
"HARDSHIP_FLAG",
"HARDSHIP_TYPE",
"HARDSHIP_REASON",
"HARDSHIP_STATUS",
"DEFERRAL_TERM",
"HARDSHIP_AMOUNT",
"HARDSHIP_START_DATE",
"HARDSHIP_END_DATE",
"PAYMENT_PLAN_START_DATE",
"HARDSHIP_LENGTH",
"HARDSHIP_DPD",
"HARDSHIP_LOAN_STATUS",
"ORIG_PROJECTED_ADDITIONAL_ACCRUED_INTEREST",
"HARDSHIP_PAYOFF_BALANCE_AMOUNT",
"HARDSHIP_LAST_PAYMENT_AMOUNT",
"DISBURSEMENT_METHOD",
"DEBT_SETTLEMENT_FLAG",
"DEBT_SETTLEMENT_FLAG_DATE",
"SETTLEMENT_STATUS",
"SETTLEMENT_DATE",
"SETTLEMENT_AMOUNT",
"SETTLEMENT_PERCENTAGE",
"SETTLEMENT_TERM",
"ISSUE_YEAR",
"ISSUE_MONTH"
FROM (
SELECT
"LOAN_DATA".*,
CAST(SUBSTR("ISSUE_D", 5) AS INT) AS "ISSUE_YEAR",
CAST(CASE
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Jan')) THEN 1.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Feb')) THEN 2.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Mar')) THEN 3.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Apr')) THEN 4.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('May')) THEN 5.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Jun')) THEN 6.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Jul')) THEN 7.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Aug')) THEN 8.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Sep')) THEN 9.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Oct')) THEN 10.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Nov')) THEN 11.0
WHEN (SUBSTR("ISSUE_D", 1, 3) IN ('Dec')) THEN 12.0
END AS INT) AS "ISSUE_MONTH"
FROM "LOAN_DATA"
) "q01"
) "q01"
) "q01"
This SQL query uses our model to compute a predicted value for each
loan in the LOAN_DATA table. It creates a table with two
columns, ID and .PRED, which are all we need
when creating a table of predictions. Later, if someone wants to use our
prediction table, they can use the ID column to join the
predictions back to the loan data.
Now, we’ll take this SQL and create a view. First, we create a name for our view by glueing together the model name and version.
versioned_view_name <- glue("{model_name}_v{model_version}")
Then, we use glue_sql() to combine the SQL needed to
create a view with our model prediction SQL from above.
snowflake_view_statement <-
glue_sql(
"CREATE OR REPLACE VIEW {`versioned_view_name`} AS ",
view_sql,
.con = con
)
Finally, we execute this complete SQL query with
DBI::dbExecute().
con |>
DBI::dbExecute(snowflake_view_statement)
[1] 0
aside positive
dbExecute()returned0because it returns the number of rows changed. We created a new view, so we changed0rows.
Our view is now in Snowflake and ready to use!
So far, we’ve only created one view for our one model version. But what if you have multiple views corresponding to multiple versions of your model? It would be helpful to have a view that always corresponds to the latest version of the model. Let’s make that view.
First, we’ll make a name for this new view.
main_view_name <- glue::glue("{model_name}_latest")
Then, we again use glue_sql() and
dbExecute() to create the view.
main_view_statement <- glue::glue_sql(
"CREATE OR REPLACE VIEW {`main_view_name`} AS ",
"SELECT * FROM {`versioned_view_name`}",
.con = con
)
con |>
DBI::dbExecute(main_view_statement)
We can take a look at this view using tbl() and
collect() to pull the view into R.
con |>
tbl(main_view_name) |>
head(100) |>
collect()
# A tibble: 100 × 2
ID .pred
<chr> <dbl>
1 142212962 16.7
2 142529846 12.4
3 141649517 13.7
4 141671381 13.1
5 141449070 16.9
6 142661219 11.6
7 142647391 15.9
8 142646325 8.37
9 142211815 9.98
10 142661174 11.1
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows
Now, we have a table of predictions we can share with others in our organization through Snowflake. They can use our view no matter what tools they’re using to interact with Snowflake, including Snowsight, as shown below.
Duration: 1
As new data comes in over time, it would be useful to be able to refit our model. We can refit periodically, monitor performance, and store the best-performing version on Posit Connect. We won’t cover this in detail here, but we’ve put together sample code for the process in this Quarto document.
This blog post also covers the process.
Duration: 2
In this Quickstart, you learned how to use orbital and tidymodels to build, deploy, and run predictions directly in Snowflake. By combining the power of Snowflake with the flexibility of R, you can efficiently scale your modeling workflows and share results as views, all within a secure environment.